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Abstract 



In this paper, we develop a framework for information theoretic learning based on 
infinitely divisible matrices. We formulate an entropy-like functional on positive 
£^ \ definite matrices based on Renyi's axiomatic definition of entropy and examine 

some key properties of this functional that lead to the concept of infinite divisibil- 
ity. The proposed formulation avoids the plug in estimation of density and brings 
along the representation power of reproducing kernel Hilbert spaces. As an appli- 
cation example, we derive a supervised metric learning algorithm using a matrix 
based analogue to conditional entropy achieving results comparable with the state 
of the art. 
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> . 1 Introduction 

If) 

ly— j ■ Information theoretic quantities are descriptors of the distributions of the data that go beyond second- 

| order statistics. The expressive richness of quantities such entropy or mutual information has been 

shown to be very useful for machine learning problems where optimality based on linear and Gaus- 
sian assumptions no longer holds. Nevertheless, operational quantities in information theory are 
based on the probability laws underlying the data generation process, which are rarely known in 
the statistical learning setting where the only information available comes from the sample {zi}" =1 - 
Therefore, the use of information theoretic quantities as descriptors of data, requires the devel- 
opment of suitable estimators. In [1|, the use of Renyi's definition of entropy along with Parzen 
density estimation is proposed as the main tool for information theoretic learning (ITL). The opti- 
mality criteria is expressed in terms of quantities such as Renyi's entropy, divergences based on the 
Cauchy-Schwarz inequality, quadratic mutual information, among others. Part of the research effort 
in this context has pointed out connections to reproducing kernel Hilbert spaces [2 j. Here, we show 
that these connections are not only valuable from a theoretical point of view, but they can also be 
exploited to derive novel information theoretic quantities with suitable estimators from data. 

Positive definite kernels have been employed in machine learning as a representational tool allowing 
algorithms that are based on inner products to be expressed in a rather generic way (the so called 
"kernel trick"). Algorithms that exploit this property are commonly known as kernel methods. Let 
S£ be a nonempty set. A function k : X x SC K is called a positive definite kernel if for any 
finite set {xi}f =l C 3C and any set of coefficients {a.i}f =l C K it follows that , aiOLjK{xi,Xj) > 0, 
if at least one i, Of; ^ 0. In this case, there exist an implicit mapping : SC H> ,¥£ that maps any 
element x € S£ to an element <j>(x) in a Hilbert space Jrff, such that K(x,y) = (<j)(x),(j>{y)). The 
above map provides an implicit representation of the objects of interest that belong to the set 3£ . 
The generality of this representation has been exploited in many practical applications, even for 
data that do not come in standard vector representation K d J3. This is possible as long as a kernel 
function is available. 



More recently, it has been noticed that kernel induced maps are also useful beyond the above ker- 
nel trick in a rather interesting fashion. Namely, kernels can be utilized to compute higher-order 
statistics of the data in a nonparametric setting. Some examples exploring this idea are: kernel in- 
dependent component analysis 14), the work on measures of dependence and independence using 
Hilbert-Schmidt norms 0, and the quadratic measures of independence proposed in J6). It is not 
surprising, yet important to mention, that a similar observation have also been reached from the work 
on ITL since one of the original motivations in using information theoretic quantities is to go beyond 
second order statistics. The work we introduce in this paper goes along these lines. The twist that 
rather than defining an estimator of a conventional information theoretic quantity such as Shannon 
entropy, we propose a quantity build from the data that satisfies similar axiomatic properties to those 
of well establish definitions such as Renyi's definition of entropy 

The main contribution of this work is to show that the Gram matrix obtained from evaluating a 
positive definite kernel on samples can be used to define a quantity based on the data with properties 
similar to those of an entropy without assuming that the probability density is being estimated. 
Therefore, we look at the axiomatic treatment of entropy and adapt it to the Gram matrices describing 
the data. In this sense, we think about entropy as a measure inversely related to the amount of 
statistical regularities (structure) directly from the data that can be applied as the optimality criterion 
in a learning algorithm. As an application example, we derive supervised metric learning algorithm 
that uses conditional entropy as the cost function. This is the second contribution of this paper, and 
the empirical results show that the proposed method is competitive with current approaches. 
The main body of the paper is organized in two parts. First, we introduce the proposed matrix-based 
entropy measure using the spectral theorem along with a set of axiomatic properties that our quantity 
must satisfy. Then, the notion of joint entropy is developed based on Hadamard products. We look 
at some basic inequalities of information and how they translate to the setting of positive definite 
matrices, which finally allow us to define an analogue to conditional entropies. In the development 
of these ideas, we find that the concept of infinitely divisible kernels arises and become key to our 
purposes. We revisit some of the theory on infinitely divisible matrices, to show how it links to 
the the proposed information theoretic framework. In the last part, we introduce an information 
theoretic supervised metric learning algorithm. We show how the proposed analogue to conditional 
entropy is a suitable cost function leading naturally to a gradient descent procedure. Finally, we 
provide some conclusions and future directions. 



2 Positive Definite Matrices, and Renyi's Entropy Axioms 

Let us start with an informal observation that motivated our matrix based entropy. In 0], the use 
of Renyi's entropy is proposed as an alternative to the more commonly adopted definition of en- 
tropy given by Shannon. In particular, it was found that Renyi's second-order entropy provides an 
amenable quantity for practical purposes. An empirical plug in estimator of Renyi's second-order 
entropy based on the Parzen density estimator f(x) — - YH=i K(xj,x), can be obtained as follows: 

1 " 

-!°g-2 E h ( x U x j)> (!) 
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where h(x,y) — JV K a (x,z)K a (y,z)dz- Note that since h is a positive definite kernel, there exists a 
mapping <p to a RKHS such that h(x,y) = and the argument of the log in (fl}, called the 

information potential, can be interpreted in this space as a norm: 

/in in \ 
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with the limiting case given by ||E[0(X)]|| 2 . Thus, we can think of this estimator as an statistic 
computed on the representation space provided by the positive definite kernel h. Now, let us look 
at the case where K a is the Gaussian kernel; if we construct the Gram matrix K with elements 
Kjj = K a (xi,Xj), it is easy to verify that the estimator of Renyi's second-order entropy based on ([TJ 
corresponds to: 

# 2 (X) = -log(^tr(KK)\ (3) 
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(2) 



As we can see, the information potential estimator can be related to the norm of the Gram matrix K 
defined as ||K|| 2 = tr(KK). From the above informal argument two important question arise. First, 
it seems natural to ask whether other functionals on Gram matrices allow information theoretic 
interpretations and can be further utilized as objective functions in ITL. Secondly, even though h 
was originally derived from a convolution of Parzen windows, was there anything about the implicit 
representation that allows to interpret (f2) in information theoretic terms? 



2.1 Renyi's Axioms for Gram matrices 

Real Hermitian matrices are considered generalizations of real numbers. It is possible to define a 
partial ordering on this set by using positive definite matrices, which are a generalization of the 
positive real numbers. Let M n be the set of all n x n real matrices; for two Hermitian matrices 
A,B G M„, we say A !>= B if A — B is positive definite. Likewise, A >~ B means that A — B is strictly 
positive definite. 

The following spectral decomposition theorem [7 1 relates to the functional calculus on matrices and 
provides a reasonable way to extend continuous scalar-valued functions to Hermitian matrices. 

Theorem 2.1 Let D C C be a given set and let JV n (p) := {A £ M n : A is normal and <j(A) c D}. 
If fit) is a continuous scalar-valued function on D, then the primary matrix function 
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f(A)=U 



U* (4) 



V o ••• M) 

is continuous on jV n (p), where A = UAU*, A = diag(X\ , . . . , A„), and U € M„ is unitary. 

Equipped with the above result, we can define matrix functions such as /(A) =A r for r £ R + , which 
will be used in defining the following matrix-based analogue to Renyi's a-entropy. The functional 
will then be applied to Gram matrices constructed by pairwise evaluation of a positive definite kernel 
on the data samples. 

Consider the set A+ of positive definite matrices A € M„ for which tr(A) < 1 . It is clear that this set 
is closed under finite convex combinations. 

Proposition 2.1 Let A e A+ and BeA+ and also tr(A) = tr(Z?) = 1. The functional 

5«(A) = T ^log 2 [tr(A a )], (5) 
satisfies the following set of conditions: 

( i) S a {PAP* ) = S a (A) for any orthonormal matrix P G M„ 

(ii) S a {pA) is a continuous function for < p < 1. 
(Hi) Sai^I) =log 2 n. 

(iv) Sa{A®B)=S a (A)+S a (B). 

(v) If AB — BA — 0; then for the strictly monotonic and continuous function g(x) = 2^ a ^^ x for 
a ^ 1 and CC > 0, we have that: 

S a {tA + (1 - t)B) = g- 1 (tg(S a (A)) + (1 - t)g(S a (B))) . (6) 



Proof 2.1 The proof of ([7J easily follows from Theorem \2.1\ Take A = UAU* now PU is also a 
unitary matrix and thus f(A) = f(PAP*) the trace functional is invariant under unitary transforma- 
tions. For (|77J, the proof reduces to the continuity of log 2 (p) a . For dnTJ , a simple calculation 

yields trA" = (1)™ . Now, for property rfTvl ), notice that if trA = trB = 1, then, tr(A(g)B) = 1. 
Since A = UAU* and B = VTV* we can write A (g) B = (U (g) V ) (A ® T) (U ® V)*, from which 
tr(A®B) a = tr(A®r) a = tr(A a )tr(r a ) and thus (O is proved. Finally, (0 notice that for any 



integer power k oft A + (l-t)Bwe have: (tA + (1 -t)B) k = {tA) k + ((1 - 1 )B) k since AB ~BA = 0. 
Under extra conditions such as f{0) = the argument in the proof of Theorem \2.1\ can be ex- 
tended to this case. Since the eigen-spaces for the non-null eigenvalues of A and B are orthogonal 
we can simultaneously diagonalize A and B with the orthonormal matrix U, that is A = UAU* 
and B = UTU* where A and T are diagonal matrices containing the eigenvalues of A and B re- 
spectively. Since AB = BA = 0, then AT = 0. Under the extra condition f(0) = 0, we have that 
f(tA + (1 - t)B) = f{tA) + /((l - t)B) yielding the desired result for (0. 

Notice also that if the rank of A, p (A) = 1, the entropy S a (A) — for any a^O. 
It is also true that, 

S a (A)<S a (-I)=log 2 n. (7) 
n 

As we can see © satisfies some properties attributed to entropy. Nevertheless, such a character- 
ization may not fully endow all unit-trace positive definite matrices with an information theoretic 
interpretation. Which descriptors are suitable in representing joint-spaces? What properties should 
be satisfied for the matrices in order to be applied to concepts that links to random variables such 
as conditioning? In what follows, we address these points by developing notions of joint entropy 
and conditional entropy, for which additional properties must be fulfilled. Recall that the notions of 
joint and conditional entropy are not only important for the above, but they also provide the means 
to propose objective functions for learning that are based on information theoretic quantities. 



2.2 Hadamard Products and the Notion of Joint Entropy 

Positive kernels are also useful in integrating multiple modalities. Using the the product kernel, 
we can readily define the notion of joint-entropy. Consider a sequence of sample pairs {(*/,)>«)};= l 
where x, 6 X and £ W . Assume, we have a positive definite kernels K\ defined on 3£ x 3£ 
and K"2 defined on 3£ x X . The product kernel K((xi,yi), (xj,yj)) — K\(xi,Xj)K(yi,yj) is a positive 
definite kernel on (JT x x (iST x W). As we can see the Hadamard product arises as a joint 
representation in a our matrix based entropy. Consider two matrices A and B in A„ with unit trace, 
for which there exists some relation between the elements Aij and By for all i and j. The joint 
entropy can be defined as: 

/ AoB \ 

It is important then to verify that the definition of joint entropy ^ satisfies a basic intuition about 
uncertainty. The joint entropy should never be smaller than any of the individual entropies of the 
variables that conform it. The following proposition verifies this intuition for as subset of the unit 
trace positive definite matrices. 

Proposition 2.2 Let A and B be two n x n positive definite matrices with trace 1 with nonnegative 
entries, and An — - for i — 1,2,... ,n. Then, the following inequality holds: 

( AoB \ , s 

s «{w^))- Sa(B) > (9) 



2.3 Conditional Entropy as a Difference Between Entropies 

The conditional entropy of X given Y, which can be understood as the uncertainty about X that 
remains after knowing the joint distribution of X and Y, can be obtained from a difference between 
two entropies. In the Shannon's definition of conditional entropy, H(X\Y) can be expressed as 
H(X\Y) — H(X ,Y) — H(Y). The properties of this definition has been recently studied in the case of 
Renyi's entropies [8 1 and in the matrix case, this definition yields: 

S a (A\B)=S a (^L^-S a (B) (10) 

for positive semidefinite matrices A and B with nonnegative entries and unit trace such that A,-, = i 
for all i = 1, . . . ,n. The above quantity is nonnegative and upper bounded by S a (A). Certainly, 



normalization is an important property of the matrices involved in the above results. If A and B are 
normalized to have unit trace, then for r £ [0, 1] it is true that the Hadamard product of 

A or oB°^- r \ (11) 

is also normalized. However, it is not always true that the resulting matrix (fTTT > is positive definite. 
This product can be thought as a weighted geometric average for which the resulting matrix will give 
more emphasis to either one of the matrices. However, if A and B satisfy a property called infinitely 
divisibility, the product is guaranteed to be positive definite Q 

3 Infinitely Divisible Functions 

The theory of infinitely divisible developed below is not new, but it is included because it provides a 
basic understanding about the role of infinitely divisible kernels in computing the above information 
theoretic quantities from data. To avoid confusion, let us describe the key points to bear in mind 
from the above mathematical description. Infinitely divisible kernels and negative definite functions 
are tied together trough the exponential a logarithm functions. Both functions provide Hilbert space 
representations of the data. We can think of the RKHS of the infinitely divisible kernel as a rep- 
resentation to compute the higher order descriptors of the data. On the other hand, the Hilbertian 
metric can be the representation space for which we want to compute the high order statistics. Nor- 
malization, as we show below is not only important in satisfying the conditions for the information 
theoretic quantities already defined, but it also shows that many possible representational choices 
are equivalent. 

3.1 Negative Definite Functions and Hilbertian Metrics 

Let jtft = (3£ ,d) be a separable metric space. A necessary and sufficient condition for ^ 
to be embeddable in a Hilbert space Jif is that for any set {x,} C X of n + 1 points, 
YHj=\ a i a i (d 2 (xo,Xj) +d 2 (xo,Xj) — d 2 (x/,Xj)) > 0, for any a € W. This condition is equivalent 
to Y!ij=o OCjOCjd 2 (xi,Xj) < 0, for any a 6 R" +1 , such that YJi=o a ' = 0- This condition is known 
as negative definiteness. Interestingly, the above condition implies that exp(— rd 2 (xi,xj)) is positive 
definite in S£ for all r > [ 9 ] . Indeed, matrices derived from functions satisfying the above property 
conform a special class of matrices know as infinitely divisible. 

3.2 Infinitely Divisible Matrices 

According to the Schur product theorem A >p implies A°" = A o A o ■ ■ ■ o A )>=0 for any positive 
integer n. Does the above hold if we to take fractional powers of A? In other words, is the matrix 

A°m ^ for any positive integer ml This question leads to the concept of infinitely divisible matrices 
ifTOirTTl . A nonnegative matrix A is said to be infinitely divisible if A° r )p for every nonnegative 
r. Infinitely divisible matrices are intimately related to negative definiteness as we can see from the 
following proposition 

Proposition 3.1 If A is infinitely divisible, then the matrix Bij — — logAy is negative definite 

From this fact it is possible to relate infinitely divisible matrices with isometric embedding into 
Hilbert spaces. If we construct the matrix 

Dy^y-l^+fi;;), (12) 

using the matrix B from proposition l3.ll There exists a Hilbert space Jff and a mapping (j) such that 

Dy = ||*(0-*O')ll3r- d3) 
Moreover, notice that if A is positive definite —A is negative definite and expAy is infinitely divisible. 
In a similar way, we can construct a matrix, 

D i j = -A ij + -{A U +Ajj), (14) 

'By this, we also mean positive semidefinite 
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Figure 1. Spaces involved in the infinitely divisible matrix framework 



with the same property (foi l. This relation between (TTZt and (TBI l suggests a normalization of in- 
finitely divisible matrices with non-zero diagonal elements that can be formalized in the following 
theorem. 

Theorem 3.1 Let St? be a nonempty set, and let d\ and di be two metrics on it, such that for any set 

n 

{Xi}" = p £ CCiajdj(xi,Xj) < 0, for any OC € W, and YI}=\ °k = 0, ' s true for 1= 1,2. Consider the 

ij=l 

matrices Ay = exp— dj(xi,xj) and their normalizations defined as: 

(15) 




Then, if = for any finite set {xt}" =l C St?, there exist isometrically isomorphic Hilbert 
spaces J%\ and J%2, that contain the Hilbert space embeddings of the metric spaces ( St? , di), I = 1 , 2. 
Moreover, A^ are infinitely divisible. 

Figure Q] summarizes the relation between spaces that are considered in the proposed framework. 
The object space St? can be directly mapped into using an infinitely divisible kernel K, or it 
can be mapped to a Hilbert space 3%, if a negative definite function d, is employed as the distance 
function. The spaces J^ K and Jifj are related by the log and exp functions. 

4 Application to Metric Learning 

4.1 Adaptation Using the Matrix-Based Entropy 

By definition, the matrix entropy functional (0 fall into the family of matrix functions know as 
spectral functions. These functions only depend on the eigenvalues of matrix and therefore their 
name [ 12 1. Using theorem (1.1) from iTPTl it is straightforward to obtain the derivative of (|5) at A as 

^^ (l-aMAaf ^' (16) 

where A = UAU*. It is important to note that this decomposition can be used to our advantage. 
Instead of computing the full set of eigenvectors and eigenvalues of A, we can approximate the 
gradient of S a by using only a few leading eigenvalues. It is easy to see that this approximation will 
be optimal in the Frobenius norm ||X||p ro = y/tr(X*X). 

4.2 Metric Learning Using Conditional Entropy 

Here, we apply the proposed matrix framework to the problem of supervised metric learning. This 
problem can be formulated as follows. Given a set of points {(x,-,Z,)}" =1 , we seek a positive 



semidefinite matrix AA T , that parametrizes a Mahalanobis distance between samples x,x' £ M. d 
as d(x,x') = (x — x') T AA T (x — x'). Our goal is to find parametrization matrix A such that the con- 
ditional entropy of the labels /; given the projected samples y,- = A T x ; - with y; £ M. p and p <C d, is 
minimized. This can be posed as the following optimization problem: 

minimize S a (L\Y) 

AeR dx i> 

subiectto A 1 * = y,,fori=l (l7) 
subject to ^ A T A ^ =Pi 

where the trace constraint prevents the solution from growing unbounded. We can translate this 
problem to our matrix-based framework in the following way. Let K be the matrix representing the 
projected samples 

1 

Ku = -exp 
n 

and L be the matrix of class co-occurrences where Ly = - if Z, = lj and zero otherwise. The condi- 
tional entropy can be computed as S a (L\Y) = S a (nKo L) — 5 a (K), and its gradient at A, which can 
be derived based on d24i l. is given by: 

X T (P-diag(Pl)XA) (18) 

where 

P= (LoVS a (nKoL)-VS a (K))oK (19) 
Finally, we can use ([TBI to search for A iteratively. 

UCI Data: To evaluate the results we use the same experimental setup proposed in lfT4l . we com- 
pares 5 different approaches to supervised metric learning based on the classification error obtained 
from two-fold cross-validation using a 4-nearest neighbor classifier. The reported errors are aver- 
ages errors from 10 runs on the two folds for each algorithm; in our case the parameters are p = 3, 
a = 1.01 and a = v3. The feature vectors were centered and scaled to have unit variance. Fig- 
ure [2(a)] shows the results of the proposed approach conditional entropy metric learning (CEML), 
information theoretic metric learning (ITML) proposed in [14], neighborhood component analysis 
(NCA) from [15|, the maximally collapsing metric learning (MCML) method from 1 16 1, the large 
margin nearest neighbor (LMNN) method found in ifPTl , and, as a baseline, the the inverse covari- 
ance and Euclidean distances. The results for the Soybean dataset are not reported since there is 
more than one possible data set in the UCI repository under that name. The errors obtained by 
the metric learning algorithm using the proposed matrix-based entropy framework are consistently 
among the best performing methods included in the comparison. 

Choice of order a: Even though the choice of the entropy order appear to be arbitrary, there is a 
motivation in choosing a close to 1 . The reason is that higher entropy order, the more prone the 
algorithm is to find unimodal solutions. This can be advantageous if prior knowledge or strong as- 
sumptions on the class distributions are taken into consideration. In our experiments, we opted for 
lower entropy order and give the algorithm more flexibility in finding a good solution. To experi- 
mentally show this phenomena, we generated a two-dimensional dataset containing points from two 
classes. In one direction the classes are very well separated but the distribution has multiple modal- 
ities. On the orthogonal direction, the classes are not fully separable, but their distributions are 
unimodal. Figure [3] shows a sample with points drawn from both classes, as we can see projecting 
the data onto the horizontal axis provides better separability at the cost of a more complex decision 
boundary. We run our metric learning algorithm 60 times for different values of a and recorded the 
direction of the resulting one-dimensional feature extractor. Table Q] shows the number of times a 
particular direction was picked by our algorithm for different entropy orders. It can be seen that for 
larger values of a, the algorithm selected the vertical direction more often. 
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1.01 


1.3 


2 


5 
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58 


35 


1 
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Vertical 


2 


25 


59 


59 



(x,--x 7 -) T AA T (x ; -x ; -) 

2(7 2 



Table 1. Occurrence of horizontal and vertical solutions versus the entropy order 
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(b) Projected faces UMist dataset and resulting Gram matrix a = v2 

Figure 2. Results for the Metric learning application 




Figure 3. Artificial data to illustrate the role of the entropy order 



UMist Faces: We also run the algorithm on the UMist dataset; This data set consists of Grayscale 
faces (8 bit [0-255]) of 20 different people. The total number of images is 575 and the size of each 
image is 1 12x92 pixels for a total of 10304 dimensions. Pixel values were normalized by dividing 
by 255 and removing the mean. Figure [2(b)| shows the images projected into R 2 . It is remarkable 



how a linear projection can separate the faces, and it can also be seen from the Gram matrix that it 
tries to approximate the co-occurrence matrix L. 



5 Conclusions 

In this paper, we presented a data-driven framework for information theoretic learning based on 
infinitely divisible matrices. We define estimators of entropy-like quantities that can be computed 
from the Gram matrices obtained by evaluating infinitely divisible kernels on pairs of samples. 
The proposed quantities do not assume that the density of the data has been estimated, this can be 
advantageous in many scenarios where even defining a density is not feasible. We discuss some key 
properties of the proposed quantities and show how they can be applied to define useful analogues to 
quantities such as conditional entropy. Based on the proposed framework, we introduce a supervised 
metric learning algorithm with results that are competitive with the state of the art. Nevertheless, we 
believe that many interesting formulations to learning problems based on the proposed framework 
are yet to be found. It is also important to highlight that the connection between the RKHS provided 
by the infinitely divisible kernel, and the Hilbertian metrics associated with the negative definite 
functions, opens an interesting avenue to investigate formulations of information theoretic learning 
algorithms on both spaces, and the implications of choosing one or the other. 
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A Additional results and proofs 



To prove[9] we need to introduce the concept of majorization and some results pertaining the ordering 
that arises from this definition. The proposition is replicated in this appendix for the sake of self 
containment. 

Definition A.l (Majorization): Let p and q be two nonnegative vectors in W such that Y!l=\ Pi = 
YJi=\ 1i < °°- W e say p =4q, q majorizes p, if their respective ordered sequences p^ > pp] -t " " _t P[n] 
and q^ > gp] > ••• > q^ denoted by {/?[,-] }" =1 an d {/?[;]}" = i> satisfy: 

k k 

Y*PV\ - L 9[t\fork= l,...,n (20) 

1=1 1=1 

It can be shown that if p =4 q then p =Aq for some doubly stochastic matrix A [ 1 8 1 . It is also easy 
to verify that if p =<! q and p =i h then p ==<: (1 — t)h for f 6 [0,1]. The majorization order is 
important because it can be associated with the definition of Schur-concave (convex) functions. A 
real valued function / on K" is called Schur-convex if p =4 q implies f(p) < f(q) and Schur-concave 
if f(q)<f(p). 

Lemma A.l The function f a : ,5f n H > IR+ (,5f n denotes the n dimensional simplex), defined as, 

fa(p) = T ^—\0g 2 f jP f, (21) 

is Schur-concave for a > 0. 

Notice that, Schur-concavity (Schur-convexity) cannot be confused with concavity (convexity) of a 
function in the usual sense. Now, we are ready to state the inequality for Hadamard products. 

Proposition A.l Let A and B be two nxn positive definite matrices with trace 1 with nonnegative 
entries, and An — \ for i — 1,2, .. . ,n. Then, the following inequality holds: 

( AoB \ , s 



Proof A.l In proving (O, we will use the fact that S a preserves the majorization order (inversely) 
of nonnegative sequences on the n-dimensional simplex. First look at the identity 

t 1 
x l (A o B)x = tr(AD x BD x ) = - 

n 

n 

In particular, if {xi}" =l is an orthonormal basis for W, tr(AoZ?) = £ xf(AoB)xi. If we let {xi}" =l 

i=i 

be the eigenvectors of A o B ordered according to their respective eigenvalues in decreasing order, 
then, 

k k j k 

^^J(Ao B)xi = tr (AD X .BD X . ) < - £ tr (U T D Xi BD x .) 

i=l i=l n i=\ 

= -^Bxi<-Y^ylBy h (23) 

where k— 1 , . . . , n and {yi}" =1 are the eigenvectors ofB ordered according to their respective eigen- 
values in decreasing order. The inequality ( 1231 ) is equivalent to say that nX(A oB) =^ that is, 
the sequence of eigenvalues of (Ao B)/tr(Ao B) is majorized by the sequence of eigenvalues of B, 
which implies © by Lemma \A~J] 

A beautiful observation from Theorem |3.1| is that, according to equation (I lot , the proposed normal- 
ization procedure for infinitely divisible matrices can be thought of as finding the maximum entropy 
matrix among all matrices for which the Hilbert space embeddings are isometrically isomorphic. 



A.1 Derivatives of Spectral Functions 



Let H n denote the vector space of real Hermitian matrices of size nxn endowed with inner product 
(X, Y) = trXY; and let U„ denote the set of n x n unitary matrices. A real valued function / defined 
on a subset of H n is unitarily invariant if /(UXU*) = /(X) for any U € U„. Associated with each 
spectral function / there is a symmetric function F on R". By symmetric we mean that F(x) = 
F (Px) for any nxn permutation matrix P. Let A(X) denote the vector of ordered eigenvalues of X; 
then, a spectral function /(X) is of the form F(X(X)) for F a symmetric. We are interested in the 
differentiation of the composition (F o A)(-) = at XQ. The following result [ 1 3 1 allows us 

to differentiate a spectral function / at X 

Theorem A.l Let the set Q. C R" be open and symmetric, that is, for any x 6 £2 and any nxn 
permutation matrix P, Px € £2. Suppose that F is symmetric, Then, the spectral function F(X( ) ) 
is differentiable at a matrix X if and only if F is differentiable at the vector A(X). In this case, the 
gradient ofF o A at X is 

V(F o X) (X) = Udiag ( VF(A (X)))U* , (24) 
for any unitary matrix satisfying X = Udiag(A (X))U*. 



2 In here, o denotes composition rather than Hadamard product 



